70 research outputs found
Independent Range Sampling, Revisited
In the independent range sampling (IRS) problem, given an input set P of n points in R^d, the task is to build a data structure, such that given a range R and an integer t >= 1, it returns t points that are uniformly and independently drawn from P cap R. The samples must satisfy inter-query independence, that is, the samples returned by every query must be independent of the samples returned by all the previous queries. This problem was first tackled by Hu, Qiao and Tao in 2014, who proposed optimal structures for one-dimensional dynamic IRS problem in internal memory and one-dimensional static IRS problem in external memory.
In this paper, we study two natural extensions of the independent range sampling problem. In the first extension, we consider the static IRS problem in two and three dimensions in internal memory. We obtain data structures with optimal space-query tradeoffs for 3D halfspace, 3D dominance, and 2D three-sided queries. The second extension considers weighted IRS problem. Each point is associated with a real-valued weight, and given a query range R, a sample is drawn independently such that each point in P cap R is selected with probability proportional to its weight. Walker\u27s alias method is a classic solution to this problem when no query range is specified. We obtain optimal data structure for one dimensional weighted range sampling problem, thereby extending the alias method to allow range queries
The Space Complexity of 2-Dimensional Approximate Range Counting
We study the problem of -dimensional orthogonal range counting with
additive error. Given a set of points drawn from an grid
and an error parameter \eps, the goal is to build a data structure, such that
for any orthogonal range , it can return the number of points in
with additive error \eps n. A well-known solution for this problem is the
{\em \eps-approximation}, which is a subset that can estimate
the number of points in with the number of points in . It is
known that an \eps-approximation of size O(\frac{1}{\eps} \log^{2.5}
\frac{1}{\eps}) exists for any with respect to orthogonal ranges, and the
best lower bound is \Omega(\frac{1}{\eps} \log \frac{1}{\eps}). The
\eps-approximation is a rather restricted data structure, as we are not
allowed to store any information other than the coordinates of the points in
. In this paper, we explore what can be achieved without any restriction on
the data structure. We first describe a simple data structure that uses
O(\frac{1}{\eps}(\log^2\frac{1} {\eps} + \log n) ) bits and answers queries
with error \eps n. We then prove a lower bound that any data structure that
answers queries with error \eps n must use
\Omega(\frac{1}{\eps}(\log^2\frac{1} {\eps} + \log n) ) bits. Our lower bound
is information-theoretic: We show that there is a collection of
point sets with large {\em union combinatorial
discrepancy}, and thus are hard to distinguish unless we use
bits.Comment: 19 pages, 5 figure
Exact Single-Source SimRank Computation on Large Graphs
SimRank is a popular measurement for evaluating the node-to-node similarities
based on the graph topology. In recent years, single-source and top- SimRank
queries have received increasing attention due to their applications in web
mining, social network analysis, and spam detection. However, a fundamental
obstacle in studying SimRank has been the lack of ground truths. The only exact
algorithm, Power Method, is computationally infeasible on graphs with more than
nodes. Consequently, no existing work has evaluated the actual
trade-offs between query time and accuracy on large real-world graphs. In this
paper, we present ExactSim, the first algorithm that computes the exact
single-source and top- SimRank results on large graphs. With high
probability, this algorithm produces ground truths with a rigorous theoretical
guarantee. We conduct extensive experiments on real-world datasets to
demonstrate the efficiency of ExactSim. The results show that ExactSim provides
the ground truth for any single-source SimRank query with a precision up to 7
decimal places within a reasonable query time.Comment: ACM SIGMOD 202
Optimal Dynamic Subset Sampling: Theory and Applications
We study the fundamental problem of sampling independent events, called
subset sampling. Specifically, consider a set of events , where each event has an associated probability . The
subset sampling problem aims to sample a subset , such that
every is independently included in with probability . A naive
solution is to flip a coin for each event, which takes time. However,
the specific goal is to develop data structures that allow drawing a sample in
time proportional to the expected output size , which
can be significantly smaller than in many applications. The subset sampling
problem serves as an important building block in many tasks and has been the
subject of various research for more than a decade. However, most of the
existing subset sampling approaches are conducted in a static setting, where
the events or their associated probability in set is not allowed to be
changed over time. These algorithms incur either large query time or update
time in a dynamic setting despite the ubiquitous time-evolving events with
changing probability in real life. Therefore, it is a pressing need, but still,
an open problem, to design efficient dynamic subset sampling algorithms. In
this paper, we propose ODSS, the first optimal dynamic subset sampling
algorithm. The expected query time and update time of ODSS are both optimal,
matching the lower bounds of the subset sampling problem. We present a
nontrivial theoretical analysis to demonstrate the superiority of ODSS. We also
conduct comprehensive experiments to empirically evaluate the performance of
ODSS. Moreover, we apply ODSS to a concrete application: influence
maximization. We empirically show that our ODSS can improve the complexities of
existing influence maximization algorithms on large real-world evolving social
networks.Comment: ACM SIGKDD 202
PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs
{\it SimRank} is a classic measure of the similarities of nodes in a graph.
Given a node in graph , a {\em single-source SimRank query}
returns the SimRank similarities between node and each node . This type of queries has numerous applications in web search and social
networks analysis, such as link prediction, web mining, and spam detection.
Existing methods for single-source SimRank queries, however, incur query cost
at least linear to the number of nodes , which renders them inapplicable for
real-time and interactive analysis.
{ This paper proposes \prsim, an algorithm that exploits the structure of
graphs to efficiently answer single-source SimRank queries. \prsim uses an
index of size , where is the number of edges in the graph, and
guarantees a query time that depends on the {\em reverse PageRank} distribution
of the input graph. In particular, we prove that \prsim runs in sub-linear time
if the degree distribution of the input graph follows the power-law
distribution, a property possessed by many real-world graphs. Based on the
theoretical analysis, we show that the empirical query time of all existing
SimRank algorithms also depends on the reverse PageRank distribution of the
graph.} Finally, we present the first experimental study that evaluates the
absolute errors of various SimRank algorithms on large graphs, and we show that
\prsim outperforms the state of the art in terms of query time, accuracy, index
size, and scalability.Comment: ACM SIGMOD 201
Optimal algorithms for selecting top-k combinations of attributes : theory and applications
Traditional top-k algorithms, e.g., TA and NRA, have been successfully applied in many areas such as information retrieval, data mining and databases. They are designed to discover k objects, e.g., top-k restaurants, with highest overall scores aggregated from different attributes, e.g., price and location. However, new emerging applications like query recommendation require providing the best combinations of attributes, instead of objects. The straightforward extension based on the existing top-k algorithms is prohibitively expensive to answer top-k combinations because they need to enumerate all the possible combinations, which is exponential to the number of attributes. In this article, we formalize a novel type of top-k query, called top-k, m, which aims to find top-k combinations of attributes based on the overall scores of the top-m objects within each combination, where m is the number of objects forming a combination. We propose a family of efficient top-k, m algorithms with different data access methods, i.e., sorted accesses and random accesses and different query certainties, i.e., exact query processing and approximate query processing. Theoretically, we prove that our algorithms are instance optimal and analyze the bound of the depth of accesses. We further develop optimizations for efficient query evaluation to reduce the computational and the memory costs and the number of accesses. We provide a case study on the real applications of top-k, m queries for an online biomedical search engine. Finally, we perform comprehensive experiments to demonstrate the scalability and efficiency of top-k, m algorithms on multiple real-life datasets.Peer reviewe
On Range Summary Queries
We study the query version of the approximate heavy hitter and quantile problems. In the former problem, the input is a parameter ? and a set P of n points in ?^d where each point is assigned a color from a set C, and the goal is to build a structure such that given any geometric range ?, we can efficiently find a list of approximate heavy hitters in ??P, i.e., colors that appear at least ? |??P| times in ??P, as well as their frequencies with an additive error of ? |??P|. In the latter problem, each point is assigned a weight from a totally ordered universe and the query must output a sequence S of 1+1/? weights such that the i-th weight in S has approximate rank i?|??P|, meaning, rank i?|??P| up to an additive error of ?|??P|. Previously, optimal results were only known in 1D [Wei and Yi, 2011] but a few sub-optimal methods were available in higher dimensions [Peyman Afshani and Zhewei Wei, 2017; Pankaj K. Agarwal et al., 2012].
We study the problems for two important classes of geometric ranges: 3D halfspace and 3D dominance queries. It is known that many other important queries can be reduced to these two, e.g., 1D interval stabbing or interval containment, 2D three-sided queries, 2D circular as well as 2D k-nearest neighbors queries. We consider the real RAM model of computation where integer registers of size w bits, w = ?(log n), are also available. For dominance queries, we show optimal solutions for both heavy hitter and quantile problems: using linear space, we can answer both queries in time O(log n + 1/?). Note that as the output size is 1/?, after investing the initial O(log n) searching time, our structure takes on average O(1) time to find a heavy hitter or a quantile! For more general halfspace heavy hitter queries, the same optimal query time can be achieved by increasing the space by an extra log_w(1/?) (resp. log log_w(1/?)) factor in 3D (resp. 2D). By spending extra log^O(1)(1/?) factors in both time and space, we can also support quantile queries.
We remark that it is hopeless to achieve a similar query bound for dimensions 4 or higher unless significant advances are made in the data structure side of theory of geometric approximations
- …